arXiv:1501.06727v2 [stat.ML] 18 Nov 2015 


Factorization, Inference and Parameter Learning 
in Discrete AMP Chain Graphs 


Jose M. Pena 

ADIT, IDA, Linkoping University, SE-58183 Linkoping, Sweden 
j ose.m.penaOliu.se 


Abstract. We address some computational issues that may hinder the 
use of AMP chain graphs in practice. Specifically, we show how a discrete 
probability distribution that satisfies all the independencies represented 
by an AMP chain graph factorizes according to it. We show how this fac¬ 
torization makes it possible to perform inference and parameter learning 
efficiently, by adapting existing algorithms for Markov and Bayesian net¬ 
works. Finally, we turn our attention to another issue that may hinder 
the use of AMP CGs, namely the lack of an intuitive interpretation of 
their edges. We provide one such interpretation. 


1 Introduction 

Chain graphs (CGs) are graphs with possibly directed and undirected edges, and 
no semidirected cycle. They have been extensively studied as a formalism to rep¬ 
resent independence models, because they can model symmetric and asymmetric 
relationships between random variables. There are three different interpretations 
of CGs as independence models: The Lauritzen-Wermuth-Frydenberg (LWF) in¬ 
terpretation [6], the multivariate regression (MVR) interpretation [3], and the 
Andersson-Madigan-Perlman (AMP) interpretation [I]. No interpretation sub¬ 
sumes another ma¬ 
in this paper, we focus on AMP CGs. Despite being much more expressive 
than Markov and Bayesian networks m, AMP CGs have not enjoyed much 
success in the literature or in practice. We believe this is due to mainly two 
reasons. First, it is not known how to perform inference and parameter learning 
for AMP CGs efficiently, because it is now known how to factorize a probability 
distribution that satisfies all the independencies represented by an AMP CG. 
Compare this situation to that of LWF CGs, where such a factorization exists 
j4j Theorem 4.1] and thus inference can be performed efficiently [2 Section 6.5]. 
Second, AMP CGs do not appeal to intuition: Whereas the directed edges in a 
Bayesian network may be interpreted as causal relationships and the undirected 
edges in a Markov network as correlation relationships, it is not clear how to 
combine these two interpretations to produce an intuitive interpretation of the 
edges in an AMP CG. 

In this paper, we address the two problems mentioned above. First, we intro¬ 
duce a factorization for AMP CGs and show how it makes it possible to perform 




inference and parameter learning efficiently, by adapting existing algorithms for 
Markov and Bayesian networks. Second, we propose an intuitive interpretation 
of the edges in an AMP CG. We start with some notation and definitions. 

2 Preliminaries 

Unless otherwise stated, all the graphs and probability distributions in this paper 
are defined over a finite set of discrete random variables V. We use uppercase 
letters to denote random variables and lowercase letters to denote their states. 
The elements of V are not distinguished from singletons. If a graph G contains 
an undirected or directed edge between two nodes V\ and V2, then we write 
that V\ — V2 or V\ —*■ V2 is in G. The parents of a set of nodes X of G is the set 
Pao(X) = (Vi|Vi -» V2 is in G, V\ £ X and V2 6 Xj. The adjacents of X is the set 
Ado(X) = {Ui|Ui *- V2, Vi -*• V2 or Vi — U 2 is in G, V\ i X and V2 e Xj. A route 
between a node V\ and a node V n in G is a sequence of (not necessarily distinct) 
nodes V\,...,V n st V) e Ada(V i+i) for all 1 < i < n. If the nodes in the route 
are all distinct, then the route is called a path. A route is called descending if 
Vi -*• Vi +1 or Vi - Vi +1 is in G for all 1 < i < n. A route is called strictly descending 
if Vi -*• Vi +1 is in G for all 1 < i < n. The descendants of a set of nodes X of G is 
the set Dec(X) = {V n \ there is a descending route from V\ to V n in G, V\ e X 
and V n i X}. The non-descendants of X is the set Ndc(X) = V \ X \ Dec(X). 
The strict ascendants of X is the set Sao(X) = {V\\ there is a strictly descending 
route from V\ to V n in G. V\ £ X and V n € X }. A route V \,..., V n in G is called 
a cycle if V n = V\. Moreover, it is called a semidirected cycle if V n = Vi, Vi -*■ V2 
is in G and V) -*■ V)+i or V - V):+i is in G for all 1 <i <n. An AMP chain graph 
(AMP CG) is a graph whose every edge is directed or undirected st it has no 
semidirected cycles. An AMP CG with only directed edges is called a directed 
and acyclic graph (DAG), whereas an AMP CG with only undirected edges is 
called an undirected graph (UG). A set of nodes of an AMP CG G is connected 
if there exists a route in the CG between every pair of nodes in the set st all the 
edges in the route are undirected. A connectivity component of G is a maximal 
(wrt set inclusion) connected set of nodes. The connectivity components of G 
are denoted as Cc(G), whereas Ccg(X) denotes the connectivity component to 
which the node X belongs. A set of nodes of G is complete if there exists an 
undirected edge between every pair of nodes in the set. The complete sets of 
nodes of G are denoted as Cs(G). A clique of G is a maximal (wrt set inclusion) 
complete set of nodes. The cliques of G are denoted as Cl(G). The subgraph of 
G induced by a set of its nodes X, denoted as Gx, is the graph over X that has 
all and only the edges in G whose both ends are in X. 

We now recall the semantics of AMP CGs. A node B in a path p in an AMP 
CG G is called a triplex node in p if A -»• B *- C, A -» B - C, or A - B *- C is a 
subpatli of p. Moreover, p is said to be Z- open with Z £ V when 

— every triplex node in p is in Z u Sac{Z ), and 

— every non-triplex node B in p is outside Z, unless A-B-C is a subpath of 
p and Pac(B) \ Z + 0 . 



Let X, Y and Z denote three disjoint subsets of V. When there is no Z-open 
path in an AMP CG G between a node in X and a node in Y, we say that X is 
separated from Y given Z in G and denote it as X l G Y\Z. The independence 
model represented by G is the set of separations A' l G Y\Z. The independence 
model represented by G under marginalization of some nodes L £ V is the set 
of separations X ± G Yj Z with X,Y,Z £ F \ L. Finally, we denote by X l p Y\Z 
that X is independent of Y given Zina probability distribution p. We say that 
p is Markovian wrt an AMP CG G when, for all X, Y and Z disjoint subsets of 
V, if Xl G Y\Z then Xl p Y\Z. 

3 Factorization 

A probability distribution p is Markovian wrt an AMP CG G iff the following 
three properties hold for all C e Gc(G ) [0Q Theorem 2]: 

- Cl: C± p Nd G (C) sCc G (Pa G (C))\Cc G (Pa G (C)). 

- C2: p(C\Cc G (Pa G (C))) is Markovian wrt G G - 

- C3*: For all D £ C, D± p Cc G (Pa G (C)) \ Pa G (D)\Pa G {D). 

Then, Cl implies that p factorizes as 

P = II p(C\Cc G (Pa G (C))). 

CzCc(G) 

The authors of QQ p. 50] note that if p were strictly positive and G were a 
LWF CG, then each conditional distribution above would factorize further into 
a product of potentials over certain subsets of the nodes in CuPa G (C), as shown 
in [21 Theorem 4.1]. However, the authors state that no such further factorization 
appears to hold in general if G is an AMP CG. We show that this is not true 
if p is strictly positive. Specifically, C2 together with [6] Theorems 3.7 and 3.9] 
imply that 


p(C\Cc G (Pa G (C))) = n <p(K,Cc G (Pa G (C))). 

KtCs(Gc) 

However, one can show that p(I\,Cc G (Pa G (C))) is actually a function of K u 
Pa G (K), i.e. ip{K,Cc G (Pa G (C))) = <p(K, Pa G (K)). It suffices to recall from 
the proof of El Theorem 3.9] how (p(K,Cc G (Pa G (C))) can be obtained from 
p{C\Cc G {Pa G {C))) 1 a method also known as canonical parameterization [5] Sec¬ 
tion 4.4.2.1]. Specifically, let 0(K,Cc G (Pa G (C))) = logtp(K, Cc G (Pa G (C))). 
Choose a fixed but arbitrary state fc* of K. Then, 

0(fc,Cc G (Pa G (C)))= ^(-l)l fe ^logp(g,9jCc G (Pa G (G))) 

qZk 

where q* denotes the elements of fc* corresponding to the elements of K\Q. Now, 
note that p(q, q*\Cc G (Pa G (C))) = p{q,q 1r \Pa G {K)) by C3*, because Q £ K. 
Then, (p(K,Cc G (Pa G (G))) is actually a function of K u Pa G {K). 


Putting together the results above, we have that p factorizes as 


p= n n v{K,Pa G {K))= n n (i) 

CzCc(G) KtCs(Gc) CeCc(G) KeCl(Gc) 

Note that the well-known factorizations induced by DAGs and UGs (see P, 
Sections 3.2.1 and 3.2.2]) are special cases of Equation [Tj 

4 Parameter Learning 

The factorization in Equation [T] enables us to perform parameter learning for 
AMP CGs efficiently by deploying the iterative proportional fitting procedure 
(IPFP) [5] Section 19.5.7], which returns the maximum likelihood estimates of 
the entries of the potentials for some given data. Specifically, we first simplify 
further the factorization by multiplying its potentials until no potential domain 
is included in another potential domain. Let Qi, ■ ■ ■ ,Q n denote the potential 
domains in the resulting factorization. Note that each domain Qi is of the form 
K u Pa G (I\) with K 6 Cl(Gc) and C e Cc(G). Then, we run the IPFP per se: 

1 For each potential i/j(Qi) 

2 Set ip°(Qi) = 1 

3 Repeat until convergence 

4 For each potential ip t (Qi ) 

5 Set 

where p l 2 3 4 5 6 = H’Li ^(Qi), and p e is the empirical probability distribution over V 
obtained from the given data. 


5 Inference 

The factorization in Equation [T] also enables us to perform inference in AMP 
CGs efficiently by deploying the algorithm for inference in DAGs developed by 
[7:, and upon which most other inference algorithms build. Specifically, we start 
by transforming G into its moral graph G m by running the procedure below. 
This procedure differs from the one in [7], because G is an AMP CG and not 
a DAG. In any case, the moralization procedure in [7] is a special case of the 
procedure below. 

1 Set G m = G 

2 For each connectivity component C e Cc(G) 

3 For each clique K e Cl(Gc) 

4 Add the edge X ->Y to G m for all X e Pa G (K ) and Y e K 

5 Add the edge X-Y to G m for all X ,Y e Pa G (K ) st X * Y 

6 Replace all the directed edges in G m with undirected edges 



The reason of why G m has the edges it has will become clear later. We con¬ 
tinue by transforming G m into a triangulated graph G 4 , and sorting its cliques 
to satisfy the so-called running intersection property. The procedure below ac¬ 
complishes these two objectives. An UG is triangulated when every cycle in it 
contains a chord, i.e. an edge between two non-consecutive nodes in the cycle. 
The cliques of a triangulated graph can be ordered as Qi,..., Q n so that for all 
1 < j < n, Qj n (Qi u ... u Qj-i) £ Qi for some 1 < i < j. This is known as the 
running intersection property (RIP). 

1 Set G 4 = G m 

2 Repeat until all the nodes in G 4 are marked 

3 Select an unmarked node in G 4 with the largest number of marked 
neighbours 

4 Mark the node and make its marked neighbours form a complete set 
in G 4 by adding undirected edges 

5 Save the node plus its marked neighbours as a candidate clique 

6 Remove every candidate clique that is included in another 

7 Label every clique with the last iteration that marked one of its nodes 

8 Sort the cliques in ascending order of their labels 

Finally, let Q i ,... ,Q n denote the ordering of the cliques of G 4 returned by 
the procedure above. Let Sj = Qj n (Qi u... u Qj -\) and Rj = Qj \ Sj. Note that 
for every K 6 Cl(Gc) with C e Gc(G), there is some Qi st K u Pac(K) £ Qi, 
because the moralization procedure above made K u Pac(K) a complete set in 
G m and thus in G 4 . Then, 

n 

p(Y)= n n m,Pa G (K)) = Yl^Qi) ( 2 ) 

CtCc(G) KeCl(Gc) »=1 

and thus 

p(V) = f([Qi U ... u <3 n _i] \ S n ,S n )g(S n ,R n ) 

and thus 

Rn -tp[Ql U . . . U Q n - 1 ] \ S n \S n 

by 0 p. 29], and thus 

p(V) = p(QiU.. .uQ„_ 1 )p(J?„|Q 1 u.. .uQn-i) = p(QiL>.. .LiQ n _ 1 )p(R n \S n ). (3) 


Note also that 


n— 1 

p(Qi u ■ • ■ uQ n _i) = Y,P(Q i u ■ • ■ u Qn-i,r n ) = []~[ 4>{Q i )]Y J ^(Sn,r n ). (4) 

r n i =1 r n 


Then, Equations [2]l4] imply that 


p(Rn\S n ) = 


Note that S n £ Qj for some 1 < j < n by the RIP. Then, we replace 4>(Qj ) with 
4>(Qj) 4>(Smr n ), after which Equation Q] implies that 

n— 1 

p(Q i u ... u Q n — i) = )• 

2=1 

We repeat the steps above for p(Qi u ... u Q n -i) and so we obtain p(Ri\Si) for 
all 1 < % < n. Now, note that Si = 0 and, thus, p(Qi) = p(i?i[5i). Moreover, since 
S 2 £ Qi by the RIP, then 

p(S 2 ) = Y P( S 2,Q i x s 2 ) 

and thus 

P(Qz) =p(R2\S2)p(S 2 ). 

We repeat the steps above for Q 3 ,...,Q n and so we obtain p(Qi ) for all 1 < i < 
n. To obtain p(Qi\o) where o denotes some observations or evidence, we first 
remove all the entries of 4>(Qj ) that are inconsistent with o for all 1 < j < n, 
then we repeat the steps above to get p(Qi,o) and, finally, we normalize by 
p(o) = o). To obtain p(X\o) where X $ Qi for all 1 < i < n, we compute 

p(x, o) for all x as if {x, o} were the observations and, then, we normalize by 
P(o) = T, x p(x,o). 


6 Error AMP CGs 

So far in this article, we have shown how an AMP CG factorizes a probability 
distribution, and how this helps in performing parameter learning and inference 
efficiently. We believe that our findings solve some computational issues that have 
hindered the use of AMP CGs in practice. In this section, we turn our attention 
to another issue that may have also hindered the use of AMP CGs, namely the 
lack of an intuitive interpretation of their edges. Whereas the directed edges 
in a DAG may be interpreted as causal relationships and the undirected edges 
in an UG as correlation relationships, it is not clear how to combine these two 
interpretations to produce an intuitive interpretation of the edges in an AMP 
CG. We propose here a way to do it by adapting to discrete AMP CGs the 
interpretation for Gaussian AMP CGs presented in [TJ Section 5] and further 
studied in |S] Section 3]. Specifically, we propose to interpret the directed edges 
in an AMP CG as causal relationships. In other words, the parents of a node 
represent its causal mechanism. We propose to assume that this mechanism is 
deterministic but it may sometimes work erroneously. We propose to interpret 
the undirected edges in the AMP CG as the correlation structure of the errors 
of the causal mechanisms of the different nodes. To show the validity of this 
interpretation, we will first modify the AMP CG by adding a deterministic node 
for each original node to represent explicitly the occurrence or not of an error in 
its causal mechanism and, then, we will show that the original and the modified 
AMP CGs are equivalent in some sense. We call the modified CG an error AMP 
(EAMP) CG. Since an EAMP CG is an AMP CG with deterministic nodes, we 
discuss these first. 



Fig. 1 . An AMP CG and its corresponding EAMP CG. 


6.1 AMP CGs with Deterministic Nodes 

We say that a node A of an AMP CG is determined by some Z £ V when A € Z 
or A is a function of Z in each probability distribution that is Markovian wrt 
the CG. In that case, we also say that A is a deterministic node. We use D(Z) 
to denote all the nodes that are determined by Z. From the point of view of 
the separations in an AMP CG, that a node outside the conditioning set of a 
separation is determined by it, has the same effect as if the node were actually 
in the conditioning set. We extend accordingly the definition of separation for 
AMP CGs to the case where deterministic nodes may exist. Given an AMP CG 
G, a path p in G is said to be Y-open when 

— every triplex node in p is in D(Z) u Sac{D{Z)) 1 and 

— no non-triplex node B in p is in D(Z), unless A - B - C is a subpath of p 
and Pac(B) \ D(Z) * 0. 

6.2 EAMP CGs 

The EAMP CG H corresponding to an AMP CG G is an AMP CG over V u E, 
where E denotes the error nodes. Specifically, there is an error node Ex e E for 
every node A e V, and it represents whether an error in the causal mechanism 
of A occurs or not. We set Pclh{ A) = Pad A) u Ex to represent that Ex 
is part of the causal mechanism of X in H. This causal mechanism works as 
follows: If Ex = 0 (i.e. no error) then pac(X) determines the state of A to be 
the distinguished state x* aG ^ X \ else X may take any state but the distinguished 
one. The undirected edges in H are all between error nodes, and they represent 
the correlation structure of the error nodes. Specifically, the undirected edge 
Ex - Ey is in H iff the undirected edge A - Y is in G. Note that the error 
nodes are never observed, i.e. they are latent. The procedure below formalizes 
the transformation just described. See Figure |T] for an example. 

1 Set H - G 

2 For each node X eV 

3 Add the node Ex and the edge Ex -*■ A' to H 

4 Replace every edge A - Y in H st A', Ye V with an edge Ex - Ey 









Now, consider a probability distribution p{V,E ) that is Markovian wrt the 
EAMP CG H. Then, 


p(V,E)=p(V\E)p(E) = [Ylp(X\Pa G (X),E x )]p(E) (5) 

XeV 


by Cl and C3*. Moreover, in order for the causal mechanism of X in H to match 
the description above, we restrict p(X\Pa G (X), Ex) to be of the following form: 


p(X\pa G (X),E x ) = - 


1 if E x = 0 and X = zT Gp0 

0 if E x = 0 and X * j* aG(X) 

q(X\pa G (X)) if E x = 1 


( 6 ) 


where q(X\pa G (X)) is an arbitrary conditional probability distribution with the 
only constraints that q{X\pa G {X)) = 0 if X = x* a<3 ^, and q(X\pa G (X)) > 0 
otherwise. The first constraint follows from the description above of the causal 
mechanism of X in H, whereas the second is necessary for p(V) being strictly 
positive. Note that Ex is determined by Pa G (X)uX. Specifically, if X = 
then Ex - 0, else Ex = 1- Then, E is determined by V. Hereinafter, when we 
say that a probability distribution is Markovian wrt an EAMP CG, it should be 
understood that it also satisfies the constraint in Equation [G] 

We assume that p(E) is strictly positive, as a way to ensure that p(V) is 
strictly positive. This together with the fact that p(E) is Markovian wrt He, 
which follows from p(V,E) being Markovian wrt H 1 implies that p(E) factorizes 
as shown in Equation Q] and, thus, Equation [5] becomes 

p(V,E)=[Ylp(X\Pa G (X),E x )][ [I II HEk)]- (7) 

XeV E c zCc{H B )E K <iCl{H Ec ) 


Thus, it is clear that the EAMP CG H can be interpreted as we wanted: Each 
node is controlled by the causal mechanism specified in the AMP CG G, the 
mechanism is deterministic if no error occurs and it is random otherwise, and the 
errors of the different mechanisms obey the correlation structure specified in G. 
To see the last point, note that E G e Cc(He) iff G € Gc(G), and Ek e CI(He c ) 
iff K e CI(G g )• Thus, H somehow keeps the structural information in G. To 
make this claim more specific, note that the independence model represented by 
G coincides with that represented by H under marginalization of the error nodes 
which, recall from above, are latent 0 Theorem 1] Q Recall that the independence 

1 Unlike in this work, V is a Gaussian random variable in [§]■ However, that is irrelevant 
in the proof of jJU Theorem 1], The proof builds upon the following two properties 
which, as we show, also hold for the framework in this work: 

— A node Ex e E is determined by some Z E V iff Pa G {X)uX E Z. The if part follows 
from the fact shown above that Ex is determined by Pac{X)uX . To see the only 
if part, assume to the contrary that Z determines Ex but Pa G {X) u X $ Z. Then, 
XiZ or there is some Y e Pa G ( X) \ Z. If A' $ Z, then let H' be the EAMP CG 
H' over Vu E whose only edge is Ex -*■ X, and let p' be a probability distribution 




model represented by H can be read off as shown in Section 16.11 Note that that 
the independence model represented by G coincides with that represented by H 
under marginalization of the error nodes implies that the probability distribution 
resulting from marginalizing E out of a distribution p(V,E) that is Markovian 
wrt to H is Markovian wrt G and, thus, it factorizes as shown in Equation |Tj 
Specifically, recall that E is determined by V and, thus, p(V,E) is actually a 
function of V. Then, it suffices to set each potential i/j(K, Pag(K)) in Equation 
□ equal to the following product of the terms in Equation □ 

HK,Pa G (K )) = [ n p(X\Pog(X),E x )]<KE k ) 

XtK 


bearing in mind that if X belongs to several cliques K. then p(X\Pag(X), Ex) 
is assigned to only one (any) of the potentials Pag(K)). For instance, the 
following is a valid assignment for the AMP and EAMP CGs in Figure □ 


= P (A\E a )<KE a ) 

HB,A ) =p(B\A,Eb)<KE b ) 

ip(C, D, A, B) = p(C\A, E c )4>(E C i E D ) 


1>(C,F,A) - p(F\E F )cj)(Ec, E f ) 

if>(D,I,A,B) = p(Z?|A,5,E D ),/»(E D ,£ 7 ) 
^(E,/) =p(I\E I )(j>(E F ,E I ) 


Unfortunately, the opposite of the last result above does not hold. That is, not 
every probability distribution that factorizes according to an AMP CG coincides 
with the marginal of a distribution that is Markovian wrt the corresponding 
EAMP CG. To see it, let G be the AMP CG A -*■ B-C. Let H be the EAMP CG 
corresponding to G, i.e. Ea -*■ A -*■ B *- Eb - Ec -* C. Consider a probability 
distribution p(A, B,C, Ea, Eb, Ec) that is Markovian wrt H. Since as shown 
above {E a ,Eb,Ec} is determined by {A,3,0}, Equation □ implies that 

p(ao,K°,C) = p(a 0 \E A )p(b'} o \a 0 , E b )p(C\E c )(I)(E a )(I)(Eb, E c ) = p(ao\E A )^(E A ) 
p(ai,b a p,C) p(a 1 \E A )p(b 0 t 1 \ai,EB)p(C\E c )(l){EA)(t)(EB,E c ) p{a 1 \E A )(j>{E A ) 

( 8 ) 

because both {ao,6“ 0 } and (cii,b* 1 } determine that Eb - 0, which implies 
that p(&2°l a o> Eb) = p(6“ 1 |ai,£’s) = 1. Now, consider a probability distribution 
p'(A,B,C ) that factorizes according to G. Then, Equation □ implies that 

p'(ao,bT,C) = '0(uq)'0(q O , b^ 0 ,C) 

p'{ ai ,K\C) McnMaubt^Cy [ ’ 

Note that the ratio in Equation[9]is a function of C whereas the ratio in Equation 
□ is not. Therefore, p(A,B,C) + p'(A, B,C) in general. 

that is Markovian wrt H'. Note that Ex is a function of just X in p'. If X e Z , 
then let H' have the edges Ex -*• A' ■*- Y *- Ey, and let p' be Markovian wrt H' 
st x*° + X* 1 . Note that Ex is a function of just X u Y in p . Note also that in 
either case p' is Markovian wrt H, because H' is a subgraph of H. Note also that 
in neither case Ex is a function of Z in p'. This contradicts that Z determines Ex- 
— A node A' e V is determined by some Z c V iff X e Z. The if part is trivial. To see 
the only if part, note that X is determined by Z only if X € Z or Ex is determined 
by Z. However, Ex is determined by Z only if A eZ by the previous property. 









Finally, note that every node X 6 V in an EAMP CG H forms a connectivity 
component on its own. Therefore, the factorization in Equation [7] is actually of 
the same form as the factorization in Equation jTJ This comes as no surprise 
because, after all, H is an AMP CG over V u E. 

7 Discussion 

We have addressed some issues that may hinder the use of AMP CGs in practice. 
We hope that the results reported in this paper help others to deploy AMP CGs 
in practical applications. Specifically, we have shown how a discrete probability 
distribution that is Markovian wrt an AMP CG factorizes according to it. We 
have also shown how this factorization makes it possible to perform inference 
and parameter learning efficiently. Finally, we have provided an intuitive inter¬ 
pretation of AMP CGs that sheds some light on what the different edges may 
mean. Unfortunately, the interpretation provided is not perfect, i.e. not every 
probability distribution that factorizes according to an AMP CG coincides with 
the marginal of a distribution that is Markovian wrt the corresponding EAMP 
CG. We are working to solve this problem. We are also working on proving the 
opposite of the result in Section [3] i.e. proving that every probability distribution 
that factorizes according to an AMP CG is Markovian wrt it. 
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This note extends the original manuscript with new results, and corrects some errors. 


1. Factorization 


A probability distribution p is Ma rkovian wrt an AMP CG G iff the following three properties 
hold for all C e Cc(G) ( Andersson et all 12001 . Theorem 2): 


• Cl: C± p Nd G (C) ^ Cc G (Pa G (C))\Cc G (Pa G (C)). 

• C2: p(C\Cc G (Pa G (C))) is Markovian wrt G g . 

• C3*: For all D c C, D ± p Cc G (Pa G (C)) \ Pa G (D)\Pa G (D). 


Lemma 1. Cl , C2 and C3* hold iff the following two properties hold: 

• Cl*: For all D^C, D l p Nd G (D) \ Pa G (D)\Pa G (D). 

• C2*: p{C\Pa G {C )) is Markovian wrt G G - 


Proof. First, Cl* implies C3* by decomposition. Second, Cl* implies Cl by taking D = C and 
applying weak union. Third, Cl and the fact that Nd G (D ) = Nd G (C ) imply D l p Nd G (D) \ 
Cc G (Pa G (C))\Cc G (Pa G (C)) by symmetry and decomposition, which together with C3* imply Cl* 
by contraction. Finally, C2 and C2* are equivalent because p(C\Pa G (C)) = p(C\Cc G (Pa G (C))) by 
Cl* and decomposition. □ 

Given C e Cc(G) and D c C, we define the marginal graph G g as the undirected graph over D 
st X-Y is in Gg iff X-Y is in G c or X - V\ - ... - V n - Y is G c with V 1 ,...,V n iD. 


Lemma 2. Assume that p is strictly positive and Cl* holds. Then, C2* holds iff 

p{D\Pa G {C)) = n MK,Pa G (K)) (1) 

KtCs(Gg) 

for all D c C. 

Proof. To prove the if part, it suffices to take D = C and note that Gg = G g . Then, C2* holds 
(L auritzen . 19961. Proposition 3.8). To prove the only if part, we adapt the proof of Theorem 3.9 
by Lauritzen ( 19961 1 to prove that p(D\Pa G (D)) factorizes as indicated in Equation [1] This implies 
the desired result by Cl* and decomposition. Specifically, choose arbitrary but fixed states d* and 
pa G (D)* of D and Pa G (D). Given B c D, let b and pa G (B ) denote the values of D \ B and 
Pa G (D) \ Pa G (B ) consistent with d* and pa G {D)* . For all B c D, let 


H D (b,pa G (B)) = logp(b,b \pa G (B),pa G (B) ). 

Note that using the logarithm is warranted by the assumption of p being strictly positive. For all 
K c D, let 

<t> D (k,pa G {K)) = £ (-1 ) ]K ^H D (b,pa G (B)) 


( 2 ) 


B^K 


where b is consistent with k. Now, we can apply the Mobius inversion (Lauritzen, 19961 . Lemma 
A.2) to obtain 

log p(d\pa G (D)) = H D (d,pa G (D)) = Y <l>D(k,pa G (K)) 


KqD 
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where k is consistent with d. Then, it only remains to prove that 4>r)(k,pa G (K)) is zero whenever 
K £ Cs(Gg). Consider two nodes S and T of K that are not adjacent in G®. Then 

<t> D (k,pa G {K)) = £ (-1 f K ' ST ^[H D (b,pa G (B))-H D (bs,pa G (BS)) 

B£K\ST 


- H D (bt,pa G (BT )) + H D (bst,pa G (BST))] 


(3) 


where b, bs, bt and bst are consistent with k. Note that S 1 p T\(D \ ST)Pa G (C ) by C2*, and 
S ± p Pa G (C) \ -Pq G (-D)|(.D \ ST)Pa G (D ) by Cl*, symmetry, decomposition and weak union. Then, 
S 1 V T\(D '^ L ST)Pa G (D ) by contraction and decomposition. This together with Equation 3.7 by 


Lauritzenl ( 199m imply that 


H D (bst,pa G (BST )) - H D (bs,pa G (BS )) = log 


p(bst,bst \pa G {BST),pa G (BST) ) 
p(bs,bs*\pa G (BS),pa G (BS) ) 


p(s\b,bst*,pa G (BST),pa G (BST) )p(bt,bst*\pa G (BST),pa G (BST ) ) 

= log-=7- * -=7- * -• 

p(s\b,bst. ,pa G (BS),pa G (BS ) )p(b,bs \pa G (BS),pa G (BS ) ) 

Moreover, note that S ± p Pa G (T) \ Pa G {D \ T)|(D \ ST)Pa G {D \ T) by Cl*, symmetry, decompo¬ 
sition and weak union. This implies that 

H D (bst,pa G (BST)) - H D (bs,pa G (BS)) 


p(s\b, bst*,pao(BS),pa G (BST) )p(bt,bst*\pa G (BST),pa G (BST) ) 

= log-=7-----=7--7- 

p(s|6,6sf ,pa G (BS),pa G (BST ) )p(b,bs \pa G (BS),pa G (BS ) ) 

= ^ p(s*|fc,6st\pq G (ffT),pq G (.Br) )p(bt, bst*\pa G (BST),pa G (BST) ) 

p(s*\b,bst*,pa G (BT),pa G (BT) )p(b, bs*\pa G (BS),pa G (BS) ) 
Moreover, S l p Pa G (T) \ Pa G (D \ T)\(D \ ST)Pa G (D \ T) also implies that 

H D (bst,pa G (BST)) - H D (bs,pa G (BS)) 


= ^ p(s*\b,bst*,pa G (BT),pa G (BT) )p(bt,bst*\pa G (BST),pa G (BST) ) 

p(s*|6,6sf\pq G (H),pq G (H) )p(b,6s*|pq G (SS),pq G (.BS') ) 

Finally, note that D \ S l p Pa G (S ) \ Pa G (D \ S')|Pa G (H \ S') by Cl* and decomposition. This 
implies that 

H D (bst,pa G (BST )) - H D (bs,pa G (BS)) 

= 1 n ^ p(s*|ft,&st*,pa G (^r),pq G (Sr) )p(bt,bst*\pa G (BT),pa G (BT ) ) 

p(s*|b,6si\pq G (.B),pa G (.B) )p(b,bs*|pq G (.B),pq G (.B) ) 


log 


p(bt,bt \pa G (BT),pa G (BT) ) 


H D (bt,pa G {BT)) - H D (b,pa G (B)). 


p(b, b | pa G (B),pa G (B) ) 

Thus, all the terms in the square brackets in Equation [3] add to zero, which implies that the entire 
sum is zero. □ 


It is customary to think of the factors ? p G (K, Pa G (K)) in Equation Q] as arbitrary non-negative 
functions, whose product needs to be normalized to result in a probability distribution. Note 
however that Equation |T] does not include any normalization constant. The reason is that the so 
called canonical parameterization in Equation [2] permits us to write any probability distribution 
as a product of factors that does not need subsequent normalization. One might think that this 
must be an advantage for parameter estimation and inference. However, the truth is that the cost 
of computing the normalization constant has been replaced by the cost of having to manipulate a 
larger number of factors in Equation [TJ To see it, note that the size of Cs(G^) is exponential in 
the size of the largest clique in G®. 

A necessary and sufficient factorization follows. 







































Theorem 1. Let p be a strictly positive probability distribution. Then, p is Markovian wrt an AMP 
CG G iff 

P(y) = n p{C\Pa G {C)) (4) 

CeCc(G) 

with 

p(D\Pa G (C)) = n MK,Pa G (K)) (5) 

KtCs(Gg) 

for all D c C. 


Proof. The only if part holds because Cl* and decomposition imply Equation [Tj and Lemma [2] 
implies EquationEJ To prove the if part, we prove that p satisfies Cl* and C2*. Note that Nd G (C ) = 
Nd G (D). This together with Equations [I] and [5] imply that 

p(D,Nd G (D))=p(D,Nd G (C)) = l n p(U\Pa G (U))\p(D\Pa G (C)) 

\ UsCc(G):UcNd G (C) / 


= g(Nd G (D))h(D,Pa G (D )) 


and thus Cl* holds ( Lauritzen . 199GI . Equation 3.6). Finally, C2* holds by Equation [5] and Lemma 

EJ ' □ 


A more convenient necessary and sufficient factorization follows. 

Theorem 2. Let p be a strictly positive probability distribution. Then, p is Markovian wrt an AMP 
CGG iff 

P(V)= El p{C\Pa G {C)) (6) 

CeCc(G) 

with 

p(C\Pa G (C))= [1 ^ c(K,Pog(K )) (7) 

KzCs(Gc) 

and 


p(D\Pa G (C)) = p(D\Pa G (D)) 


( 8 ) 


for all D c C. 

Proof. The only if part holds because Cl* and decomposition imply Equations [6] and El and Lemma 
E] implies Equation El To prove the if part, we prove that p satisfies Cl* and C2*. Note that 
Nd G (C ) = Nd G {D). This together with Equations [6] and El imply that 


p(D,Nd G (D))=p(D,Nd G (C)) = [ n p{U\Pa G {U))\p{D\Pa G {C)) 

\ UtCc(G):U5Nd G (C) / 


■(, 


[I P (U\Pa G (U)) \p(D\Pa G (D )) = g(Nd G (D))h(D , Pa G (D)) 

UtCc{G\.UzNd G {C) / 

and thus Cl* holds ( Lauritzen . 19961 . Equation 3.6). Finally, C2* holds by Equation El (Lauritzen, 
19961 . Proposition 3.8). □ 


A necessary factorization that is more convenient for inference and parameter learning follows. 

Corollary 1. Let p be a strictly positive probability distribution. If p is Markovian wrt an AMP 
CG G, then 

P(Y)= El p(C\Pa G (C)) (9) 

CeCc(G) 

with 

p(C\Pa G (C))= n ^ c(K,Pa G (K )). (10) 

KeCs(G c ) 




















2. Parameter Learning 


Given some data, we can efficiently obtain the maximum likelihood estimates of the fa ctors in 
Equation [10] by adapting the iterative proportional fitting procedure (IPFP) for MRFs (Murphy, 
2012i . Section 19.5.7) as follows: 


1 For each C e Cc(G) 

2 Set p°(C\Pa G (C)) to the uniform distribution 

3 Compute <f>c(K, Pa G (I\)) for all K e Cs(G g ) as shown in Equation [2] 

4 Set ip G {K, Pac{K)) = exp <f>c(K, Pa G (K )) for all K e Cs{G g ) 

5 Repeat until convergence 

6 Set fc(K,Pa G (K)) = iJj c ( K, Pa G (K)) ff/ for all KzCs{G c ) 

where p e is the empirical probability distribution over V obtained from the given data, and p is the 
probability distribution over V due to the current estimates. Note that computing p(K\Pa G (K)) 
requires inference. The multiplication and division in line 6 are elementwise. Existing gradient 
ascend methods for MRFs can be adapted similarly. 

We justify the algorithm above by adapting some existing results for MRFs. We temporally drop 
the assumption that the product of factors in Equation [TU] is normalized, and replace it with 


p(C\Pa G (C))= I~[ c(K,Pa G (K)) (11) 

Z c (Pa G (C )) K J, (Gc) 


where 

Z c {Pa G {C)) = Y, El c(k,Pa G (K)) 

C KeCs(G c ) 

where k is consistent with c. Let ip denote all the factors due to Equations [9] and HU Then, the 
log-likelihood function is 

KVO = E ( E E E n (k,pa G (K)) log ip c (k,pa G (K))-n(pa G (C)) log Z c (pa G (C))\ 

CeCc(G) \ KtCs(Gc) k pa G (I<) / 


where n(k,pa G (K )) is the number of instances in the data where K and Pa G (K ) take values k and 
pa G (K ) simultaneously. Similarly for n(pa G (C)). Dividing both sides by the number of instances 
in the data, n, we have that 

l(ip)/n= E ( E E E Pe(k,pa G (K))logip c (k,pa G (K))-p e (pa G (C)) log Z c (pa G (C)) 

CeCc(G)\ KtCs(Gc) k pa c (K ) 


Let U e Cc(G) and Q e Cs(Gjj ). The gradient of l(ip)/n wrt iPu(q,P^g(Q)) is 


dl(ip)/n = p e (q,pa G (Q )) _ p e (pa G (U )) dZu(pa G (U )) 
dipu(q,pac(Q )) i’u(q,pa-G(Q )) Zu(pa G {U)) dipu(q,pa G (Q))' 


Let W = U x Q. Then 


dZir(pa G (U )) 
dipu(q,pa G (Q )) 


E n ipu(k,k,pa G (K)) 

w KzCs{Gu)\Q 


Zu(pa G (U)) 

ipu{q,pa G (Q)) 


E FI ^ u(k,k,pa G (K )) 

w KeCs(Gu)\Q 


ipu(q,pa G (Q)) 
Zu{pa G {U)) 


Zu(pa G (U )) 

^u(q,pa G (Q)) 


p(q\pa G (U )) 


where k denotes the elements of q corresponding to the elements of K n Q, and the last equality 
follows from Equation HU Note also that p{q\pa G (Q)) = p(q\pac(U)) by Cl* and decomposition. 
Putting together the results above, we have that 


dl(ip)/n = p e (q,pa G (Q )) _ p e (paG(Q))p(q\paG(Q)) 

dipu(q,pac(Q )) iPu(q,pa G (Q)) i>u{q,pac{Q)) 


Since the maximum likelihood estimates are obtained when the gradient is 0 for all the entries of all 
the factors, we have that the maximum likelihood estimates are obtained when 


ipc(k,pa G (K)) = ip c (k,pa G (K)) 


Pe(k\pa G (K )) 

p(k\pa G (I\)) 
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for all C € Cc(G) and K e C's(Gc)- This justifies the updating step in line 6 of the IPFP. 

Let the factor updated in the current iteration have the superscript t + 1, whereas the rest of the 
factors have the superscript t. Next, we show that if Zc(Pclg(C)) = 1 then Z%\Pa G {C)) = 1. 
This implies that Equations 171 and fill are equivalent because Zb(Pa G (C)) = 1 by line 3 and, thus, 
our assumption of Equation |TT] is innocuous. To see it, let U e Cc(G), Q € Cs(Gjj) and W - C \ Q. 
Then 

» t+ \Q\Pa G (U)) = Zp t+ \ w \ Pa G(U)) 


P 


= Z^u + \Q,Pa G (Q)) 

W 

= £^(Q,Pa G (Q)) 


1 


Z$\Pog(U)) 

KzCs(Gu)\Q 

p e (Q\Pa G (Q )) 1 


II i’b(k,Pa G (K)) 


n Pa G (I<)) 


P t (Q\Pa G (Q)) Zy(Pa G (U)) KeCa \£ uhQ 

Pe(Q\Pa G (Q )) Z i u (Pa G (U)) Pe(Q\Pa G (Q )) Z\j(Pog(U)) t 

TTFTB — ,nw ’Tt+i/p —TTTTtLp ( w\Pa G {U )) - — „ —77777p (Q\Pa G (U)) 

P\Q\Pa G (Q)) Zy l (Pa G (U)) V ^(Q|Pa G (( 3 )) Z^ L (Pa G (U)) 


= p e (Q\Pa G (Q)) 


ZliPacjU)) 

Zjj l {.Pa G {U)) 


since p t (Q\Pa G (Q )) = p^l-Pac^)) by Cl* and decomposition. Summing both sides over q implies 
that the IPFP preserves the normalization constant across iterations. 

3. Discussion 

Given a probability distribution p that is Markovian wrt an AMP CG G, we have described in 
Equations 0S] necessary and sufficient conditions for p to factorize wrt G. This note extends the 
original manuscript, where p is shown to factorize as 

p(V)= n n ^ c{K,Pa G {K )). 

CeCc(G) KeCs(Gc) 

To see that the condition above is necessary but not sufficient, consider the AMP CGs A -»• B - C 
and A ->• B - C *- A, and note that both imply the same factorization, namely p(A, B,C) = 
'0a(^4)V’sc( j 4 5 B, C). So, if p encodes no independence then it factorizes according to both CGs 
although it is Markovian wrt only the second of them. In any case, the factorization above is enough 
to perform efficiently inference and parameter estimation, as shown in the original manuscript. 

Unfortunately, finding the maximum likelihood estimates of the factors in the new factorization 
is difficult and, thus, we have decided to enforce only Equations [6] and [7] in the estimation process 
so that it can be performed efficiently via the IPFP. The so fitted factorization is enough to perform 
inference efficiently following the same proce dure as in the original manuscript. 

Our work is related to that by Drton ( 20081 ). where the author proposes necessary and sufficient 
conditions for p to factorize wrt G when G is a MVR CG. His factorization resembles ours in that 
it includes constraints similar to those in Equation [71 which make maximum likelihood estimation 
hard. To overcome this problem, the author develops a so called iterative conditional fitting pro¬ 
cedure (ICFP) that, at each iteration, solves a convex optimization problem under the mentioned 
constraints. We plan to study whether it is possible to adapt the ICFP to our problem, given the 


similarity between the constraints in both factorizations. Drtonl ( 2008 1 also makes the interesting 


observation that the runtime of the ICFP can be shortened by replacing G with a Markov equivalent 
CG with smaller connectivity components. It would be interesting to see whether this also applies 
to our IPFP. A result that would be helpful in that investigation is that by Sonntag and Pehal ( 2015 . 
Theorem 4), which shows how to obtain a Markov equivalent CG with the fewest undi rected edges. 

Two other works that are related to ours are those by Abbeel et al.l ( 2006h and Rov et al. ( 2009l l. 
Unlike our work, these works do not characterize when the Markovian and factorization properties 
are equivalent. Instead, they develop closed form expressions for estimating the factors in the 
factorization of p wrt G when G is a factor graph. Since factor graphs subsume AMP CGs, we can 
adapt their closed form estimates to our problem. Specifically, let C e Cc(G) and K e Cs(G g )• Also, 
let Mb G (K) denote the minimal subset of CPa G (C ) st Kl G CPa G (C) \ KMb G (K)\Mb G (K). It 
is easy to see that Mb G (K ) = Ne G (K)Pa G (K)Pa G (Ne G (K)) . Now, choose an arbitrary but fixed 
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state v* of V. Then, we can use Proposition 4 by Abbeel et ah ( 12006) ) to rewrite the factorization 
in Corollary |T] as follows (details omitted): 

p(v)=p(v*) n n ^(*0 (12) 

CeCc(G) KeCs(G c )'K*0 


where k is consistent with v, and 

'ip(k) 


exp 


( II (-l) lK '' Bl \ogp(b,b*\rnb G (Ky)\ 

\ BqK / 


(13) 


where b and mbc(K)* denote the values of I\ \ B and Mb(K ) consistent with v*. In order to esti¬ 
mate the factors above, the authors propose replacing p with the empirical probability distribution 
p e . Unfortunately, this may produce an unreliable estimate for p(v*) unless the data available is 
abundant. Similarly for p(b,b \mbG(K)*) because K and Mbc(K) may be large. Note also that 
the estimate of p(b, b \mbc{K)*) is based only on the instances of the data that are consistent with 
b and mbc(K)* simultaneously. This means that the data available is not used efficiently. All 
this leads the author s to a cknowledge that their close d form estimates, as described, are probably 
impractical ( Abbeel et al.l . 2006, p. 1764). IRov et ah (2000) improve the method above by sim¬ 
plifying Equation [13l Although the improvement alleviates the drawbacks mentioned, it does not 
eliminate them completely (e.g. Equation [12] stays the same and, thus, the problem of estimating 
p(v*) remains). Unfortunately, no experimental results are reported in either of the works cited. It 
would be interesting to compare them with our IPFP. 
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